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1. Introduction 

Consider the following regression model 

y{t.i) = xo{U) + (1) 

where the observation noise are i.i.d. realizations of a certain random variable e. The problem 
we consider in this article is that of estimating xq based on a subsample of size N « n oi the 
data collection 

(ti,?/i),...,(t„,y„). 

This occurs when, for example, obtaining the values of yi for each sample point ti is expensive 
or time consuming or because it is necessary to set up an experimental design based on previous 
data. 

Let a; TV be the chosen estimator. Intuitively we would like that 

||a;o - xatII ^ \\xn - 

where i„ is the "best" possible estimator in some sense over the whole data collection, with 
N small. That is, a good sample selection requires searching for the most informative, in some 
sense, part of the sample. 

In this article we propose a statistical regularization approach for selecting a good subsample of 
the data by introducing a weighted sampling scheme (importance weighting) and an appropriate 
penalty function over the sampling choices. This will be done by fixing a spanning family 
and considering the best approximation a;,„ of xq over In this way the problem of model 

selection and choosing a good sampling set can be considered simultaneously. This is what is 
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known as active learning. We will consider two approaches. The first, the batch approach [10], 
assumes the sampling set is chosen all at once, based on the minimization of a certain penalized 
loss function which can then be generalized to consider the problem of selecting the model 
as well. The second, the iterative approach [2], considers a two step iterative method choosing 
alternatively the best new point to be sampled and the best model given the set of points. In both 
cases, based on concentration-type inequalities we will show that the estimation schemes attain 
optimal rates while reducing the size of the sample. Although variance minimization techniques 
for choosing appropriate subsamples is a well used tool in practice, giving adequate bounds in 
probability allowing for active learning which leads to optimal rates has been much less studied 
in the regression setting. 

The article is organized as follows. In section 2 we formulate the basic problem and study a 
batch approach for simultaneous sample and model selection. In section 3 we study an iterative 
approach to sample selection and we discuss effective sample size reduction. Section 4 is devoted 
to the proof of the more technical results. 

2. Preliminaries 

2. 1 . Formulation of the problem and basic assumptions 

Wc arc interested in recovering a certain approximation of xg based on observations 

yi = X(3{ti) +£.;, i = l,...,n 

where Ei are i.i.d. realizations of a random variable e satisfying the moment condition 
MC Assume the r.v. e satisfies Ee = 0, ]E(|e|''/cr'=) < k\/2 for all A: > 2 and IE(e2) = a'^ . 
Wc also need some notation concerning the fixed design, ti,i = 1, . . . , n. Let 



1, if < = ti 
0, if not. 



Define the empirical measure: 



the associated empirical norm 



n ^-^ 

1=1 



\y\\l 



1 " 

|y|ik = -E(j'(^'))' 



n ■ 

i=l 



and the empirical scalar product 



1 " 

< y, M >„= - u[ti)y{ti). 

n ^ — ^ 



n 

i=l 



With the above notation, given any positive function r, we also introduce the r-scalar product 
< y,u >n.r= ^ X]"=i f{ti)u{ti)y{ti) and ||y||ri,r the associated empirical norm. 
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2.2. Discretization scheme 

To start with wc will consider the approximation of Xq over a finite-dimensional subspace Sm- 
This subspace will be assumed to be linearly spanned by the set {0j}je/,„ C {4'j}j>i-i with 
a certain index set. 

We assume there exists a certain density q such that 

AQ There exists a positive constant Q such that q{ti) < Q, i = 1, . . . , ri and 

J (t>i{u)<f)kiu)q{u) du = Sk^i. 

We will also require the following assumption. 

AB There exists an increasing sequence c„i such that Hi/ijlloo < Cm for j < m. 

Let Grn = be the associated empirical n x m Gram matrix (design matrix), so that 

■^Gl^DqGm — >■ l7n, whcrc Dq is the diagonal matrix with entries q{ti), for i — 1, . . . ,n. 
We will assume the following approximation property for Sm 

AS There exist positive constants a and ci < C2, such that 

cm-'-" < \\Irn - jG'mDqGmWp < C2n-'-", 

where for any matrix A, \\A\\p stands for the usual spectral norm of A in the L2 norm. 

Wc will denote by Xm G Sm the function that minimizes the weighted norm — y\\f^ ^ over 
Sm- This is, 

1 " 

Xm = arg min - V'Cj/i - x{U))'^q{U) = RmU, 
xes„, n ^-^ 
1=1 

with Rm. ~ Gm{Gl-^DqGm)-'Gl-^Dq the orthogonal projector over Sm in the g-empirical norm 

ll-lln,,. 

Let Xm '■= RmXo be the projection of xq over Sm hi the q-empirical norm || • \\n,q- Our goal is 
to choose a good subsample of the data collection such that the estimator of the unobservable 
function xq in the finite-dimensional subspace Sm, based on this subsample, attains optimal error 
bounds. For this we must introduce the notion of subsampling scheme and importance weighted 
approaches (see [2], [10]), which wc discuss below. 

2.3. Sampling scheme and importance weighting algorithm 

In order to sample the data set we will introduce a sampling probability p(t) and a sequence of 
Bernoulli(p(i2;)) random variables Wi, i = 1, . . . ,n independent of Si with p(ti) > Pmin- Let D^^q^p 
be the diagonal matrix with entries q{ti)'Wi / p(ti) . So that E (D^^q^p) = Dq. Sometimes it will be 
more convenient to rewrite Wi = lui<pi for and i.i.d. sample of uniform random variables, 

independent of {£i}i in order to stress the dependence on p of the random variables Wi. 

The next step is to construct an estimator for Xm ~ RmXo, based on the observation vector y 
and the sampling scheme p. For this, we consider a modified version of the estimator Xm = RmV- 

As the approximation of Xm, we then take (for a fixed m and p) 

Xm,p = arg min ||a; - y\\l 

xeSm ' p 

1 ^ — A, , \\2^(^?) / \ 

= arg mm - > - x[t,)) ——q{t,). 
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that is, 

for Rm.p = Gm{G\^iDw^q^pGm)^^G\^Dyj^q^p tlic orthogonal projector over Sm in the wq/p- 
empirical norm j| • \\n.wq/p- Note that this estimator depends on yi only if w{ti) ~ 1. 



2.4- Choosing a good sampling scheme (for a fixed model) 

Here, we will assume that Sm is fixed with dimension | = m. In this case we will assume 
that the bias ||a;o — s^mll^,,, which is independent of p, is known up to a constant (for example 
based on approximation errors over a fixed model space) and study rather the approximation 
error \\xm — Xm,p\\n q- The latter depends on how the sampling probability p is chosen. 

Let P := {Pk : {ti, . . . ,tn} [0,1]", fc > 1} be a numerable collection of [0,1] valued 
functions. We will assume that min^ min^ Pk{ti) > Pmin- 

A good sampling scheme p, based on the data, should be the minimizer of the non observable 
quantity \\xm — a;m,p||^ q- To overcome this difficulty, wc observe that since Rm.pXm = Xm, then 

{■^rn ■^mjp] 

— Prn,p{-^(} ^?n] ~^ Prn.p^ 

= E (i?„i,p) [xq - .T,„] + - E {Rrn,p))[xQ - X„i] + Rrn,p£- (3) 

Consider the deterministic term E {Rm,p) [xo — Xm] - We shall prove in Lemma 2.4.5 that under 
condition [AS], the term ||E {Rm,p) [xQ—Xm]\\n,q is of order 0{n~^~°' \\xQ — Xra\\n,qlPmin)- Whence, 
any minimizer should essentially account for the biggest possible values, with high probability, 
of the second and third terms. It is thus reasonable, to consider the best p as the minimizer 

p = argminpen(m,p, S, 7, n), (4) 
per 

where, for a given < 7 < 1, 

pen{m,p, S, 7, n) = {(1 + j)peni{m,p, S) + {I + l/j)pen2{m.,p, S)} 
with peril and pen2, which will be defined below, such that 

P{snp{\\{R„,.p - E {R^,p))[xo " x^]\\l - ^j^{m,p,S)} > 0) < S/2, 

V 

P(sup{||i?™,pe|j^ - ^2(m,p, 6)} > 0) < 5/2- 
r 

The last two inequalities will be examined separately in Lemma 2.4.1 and Lemma 2.4.3. These 
Lemmas together with Lemma 2.4.5 and the definition of the penalization terms assure that the 
proposed estimation procedure is not only consistent but that it achieves optimal rates. 

For each p € T', let k{p) be its corresponding index and define 

pe?ii(m,p,5) = lli-o - a;,„||^_^(^m,fc(p)(l + (5) 

with 

^mMp) = ""'(^+ ^) J^J2\og{2V^mk{p){k{p) + 1)/^), (6) 
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and finally, 



with r > 1, d = d{r) a positive constant that depends on r and Lk = L^p^ > a sequence such 
that the following Kraft condition Y.k e'V^^^^^^C^+i) ^ ^ j^^^j^g^ 
We have the following result. 

Lemma 2.4.1. Assume that the conditions [AB], [AS], and [AQJ are satisfied and that there is 
a constant Pmin > such that for all i ~ 1, . . . ,ri,, p{ti) > Pmin ■ Assume perii to be selected 
according to (5). Then for all S > we have 



P 



snp{\\{Rm,p - E {Rm,p))[xo - Xm]\\n q " pen^ (m, p, (5)} > 



< S/2 



Proof. We will achieve the proof by bounding 

\\iRrn,p - E (i?m,p))[a;o " < \\R„i,p - E (i?,„,p) llpllcco - XmWl^q- 

For this we shall consider a double application of a straightforward generalization of Theorem 
7.3 in [8], whose proof is given in the Appendix. 

Lemma 2.4.2. Let A £ ^nxm some matrix whose rows, a{l) G M™, 1=1, ...n, satisfy 
||a(0ll2 ^ K^pm for some constant K > 1. Consider the matrix A = X^/Li '^(0*^(0* ^""^ 
let Ka = ^||E {A^A) lip. Set t = (^17+ l)/4. We have the following bounds: 



• Define Er:=E{\\:;^^{A^ A -^{A^A))^^ and let 



\^/nKA V n 
Then for any r > 2, 

Let 5 < 1/2, then the following bound in probability holds true for u > \pl 



or equivalently with probability at least I — S 



With this lemma we continue the proof of Lemma 2.4.1. Recall that Rm.p = 
■^Gm{-^GmDw.,q.,pGm)~^Gl^Dyj^q^p. On the other hand, observe that since Am,p ■= 
1/nG'^DpqyjGm is a positive definite matrix its inverse exists and moreover we may write Am^p^ 



using the standard spectral notation. Also since A„i^p is symmetric we have Am^p"^ = {Am^l'^y. 
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Consider the matrix Am.p — DpqwGm- Then we have its rows dm.p{l) satisfy 



^m,p[i-)\\2 



\ 



and = ||E pAm,p^ \\p < 1 + C2(n) " ^ under assumption [AS]. 

In what follows set k = k{p) and let (5^ ~ &/{2k{k + 1)). A first applieation of Lemma 2.4.2 



then yields 



- E II, < 2Tc™v/l + C2W-"-iJ^^721og(23/4TO/j;,) 

y '^Vmin 

with probability greater 1 — 5'j.. Here, the choice of JJ. is required in order to account for the 
supremum over the collection of possible sampling schemes. 

It then follows using a classical Neumann series expansion that with probability greater than 



where 



\A;n\p\\p < ^ , , =~ (8) 



np 

Now, consider the matrix Em.p = AmJ:,'^Gl^Dpqw and note that the projection matrix Rm,p = 
^GmA^^^pG^Dpqw = ^Elj^pEm,p- Using the singular value decomposition and the definition of 
Em,p, we have 

\\-^m,p-^m,p ^ E (^Ej^ pEjyi.pj lip = \\Em,pE^^ p — E (^Ef„,pE^ p J lip, 

since the singular values are the same. Thus, it is enough for our purposes to bound \\E„i^pE'^^ p — 

E (^E,n,pE^^ p^ lip in probability. 

Next, we bound the rows of matrix E^p, e^p(/). As before and using the bound in (8) we 
have 

mQ 



V Pmin 

On the other hand, because Rm,p is a projection matrix, ||-Rni,p||p = 1 a-nd we have 

1 = E(||i?„,p||p) > sup E(||i?„.pw||2) 

|'u||2 = l 

> sup ||E(i?„,p)M||2 = ||E(i?™,p)||p 

||m||2=i 

so that Apf < 1. 

Then Lemma 2.4.2 yields the stated result by the choice of the penalization pen^{m,p, 5) and 
taking a union bound over p Cz V. 

□ 
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Lemma 2.4.3. Assume the observation noise in equation (1) is an i.i.d. collection of random 
variables satisfying the moment condition [MC]. Assume that the condition [AQ] is satisfied and 
assume that there is a constant Pmin > such that p{ti) > Pmin for all i — 1, . . . , n. Assume 
pen2 to be selected according to (7) with r > 1, d = d{r) and Lk > 0, such that the following 
Kraft inequality J2k e"V*^i^^^(™+i) < 1 holds. Then, 

P{snp{\\Rrn,pe\\l,q~p^2im,p,S)} > 0) < d/2. 

V 

Proof. For a given positive function /. recall ^ l/^^X^ILi /(^j)""!^*)- Let u G Sm and 

consider a linear application A : i?" i?™. Define rj{A,z) := \J z^A^Az for z S i?" and 
r//(A, z) := sup||„||^^^i YTi^i fiti)zi{A*u)i. Then, ///(A, z) = ri{ADy^,z), where Df is the diag- 
onal matrix with entries fi. 

1 /2 

On the order hand, note that ||i?m,p£||ri,g = 'ri{Rm,pDq ,e). The proof then follows directly 
from the next lemma, whose proof is contained in [7]. 

Lemma 2.4.4. Let e = (ei, . . . ,£„)* be a vector of i.i.d. random variables satisfying the moment 
condition [MC]. Let A be a given m x n matrix. Define r][A) = r]{A,e) = V e^A^Ae. Then, for 
r > 1 , M > and L > there exists a positive constant d that depends on r such that the following 
inequality holds 

PirfiA) > a^[Tr{A'A) + p{A'A)]r{l + L) + a^u) 

< exp{-^/d{l/p{A*A)u + rL[Tr{AtA)/p{AtA) + 1])}. 

To apply Lemma 2.4.4, we have study the terms of the trace and spectral radius of the matrix 
r = {Rm,pDy^)'^ Rm,pDy^ ■ But, as i?m,p is a projection operator then ^^(r) < Qni and the 
spectral radius p{T) < Q. 

Thus, we have 



p(^sup {\\R,n,pe\\l g - pen2{m,p,S)} > 0^ 
< (5/2 X ^cxp{-y/drLk[m + l]} 



k 

which yields the desired result. □ 

The next lemma control the bias term. 
Lemma 2.4.5. Under condition [AS] if m = o{n) and p~^l^ = o{n), then 

||E (i?™,p) [xo - x™]||„,, = o(!^:^l^^i^^!^). 

Pmin 

Proof. Recall from Lemma 2.4.1 A,n,p = l/nGl^Dm,q,pG„i and set Am = E (A,n,p) = l/nGl^DqG„ 
Then R,n,p = l/nGjnA~lpGl,^D^^q^p and 

Rm = E (l/nG,„A-iG*„,D„,5,p) = l/nGmAm^Gl,Dq. 
Remark that under condition [AS], \\Am — I\\p < C2n~^^". 
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Set Qm,p = Am,p - /, SO that ||E {Qm,p) lip < ^ Set Ka := Cm^/Q/pmin- By Lemma 
2.4.2 we have, for any r > 2, 

:= [E(||A„,,p-E(A™,p)||ni/'^ 

^ /2^ /7^y^23/w/2,-/2. 

where/l= ||E(An.p)||p = 0(l). 

Whence, for big enough r, since m = o(n) and = o(n), we have Em^p^r = 0{n~^~") and 
thus by Holder's inequahty that E (||Qm.p||p) < 0{n~^^"). The latter yields that, in particular, 
E (IIQm.pllp) < 1- Set Cm,p := ^m!p ~ Usiug the classical Neumann expansion, under condition 
[AS], by the Monotone Converge Theorem we may finally bound E (||Cm,p||p) < cn~-^~°' for a 
certain positive constant c. We also have = I + C,n with ||Cm||p < n~^~". 

On the other hand remark, from the definition of the spectral norm, that for any matrix B, 
||i?||p = II-B'^IIp = i/ll-B * i?-^||p, so that for any given matrix M, 

||M||p||l/nG,„GL||p||D 

< llMllplllxo - a;,„]||„,,/pmin, 

where the last bound follows from the definition of the diagonal matrix Dw,q,p and the bounds 
on ||l/nGmG^||p under condition [AS]. 

Then, since by definition Rm [xq — Xm] = we have 



|[E (-Rm,p) [-^0 ^m]||n,g 

= ||E (l/nG,„[/ + G™,p + A;^^ - A„^]G^A«,g,p) [xo - a;™]||„,g 

— {^/nGmCm.pGl^D^^q^p^ [xq — Xm]||„^g 

+ ||l/nG„[/~4-i]GLE {D 

w,q,p) [-^O ^77i\\\n.q 

<E(||a„,p||p||l/nG„G^||p||D 

i(;,g,p||p) 11^0 ^m||n,g 

+ ||/- 4-i||p||l/nG™G*J|p||E {D.^.q.p) \\p\\xo - x-„,||„,, 
< c[ h n"^""]||a;o - x„i\\n,q, 

Pmin 

where, for the line before last we have used ||E(i3)||p < E(||i3||p) for any given matrix B, and 
the last line follows from the above discussion. 

□ 

Lemmas 2.4.1 and 2.4.3 yield the following consistency result. 

Theorem 2.4.6. Assume that the conditions [AB], [AS] and [AQ] are satisfied and that we use 
a sampling strategy p{ti) satisfying 

p = argmin{(l + ^)peni{m,p, S) + (1 + l/j)pen2{'ni,p, S)} 

p<£V 

with peril '^'^'^ P(^2 defined in (5) and (7) for < 7 < 1. Then the following inequality holds 
with probability greater than 1 — 5 



<6(||E(i?™,p)(a:„-a-o)||f,,, 

+ (1 + 'y)pejii{m,p, 5) + [l + l/j)pen2{m,p, S)). 
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Proof. For any p £ V, 

+ {pen(TO,p, (5, 7, n) - ||a;„i - x^.pWn^q] 

where pen{m.,p, 5, 7, n) = (l+7)peri]^(m,p, J) + (1 + 1/ j)pen2{rn,p, S) with penj^ and pen2 defined 
in (5) and (7). 

On the other hand, recall that (see equation (3)), 

Since for any < 7 < 1, 2ab < 70^ + 1/76^ holds for all a,b G TR, following standard arguments 
we have 

ll^^rn ~ ^m,p\\n,q 

< 2\\E {Rm,p) [xo - x,„||2 + 2(1 +7)11 - E (i?,„,p))[a;o - a;,„]f 

+2(l + l/7)||i?™,p£||2_^. 

Thus, 

||2;m ~ ^m,j5|ln.(} 



<6||E(i?„,p)(x™-xo)||,',,, 

+6(1 + 7)perii(m,p, 5) + 6(1 + l/j)pen2{m,p, 6) 
+6(l + 7)(sup{||i?m,p(x — Xq) — E {Rm,p) ( 

—peni{m,p, S) }) 

+6(1 + 7"^)(sup{||i?™.pe||,2 -^2(™>-P. '5)}) 
•p 

Finally, as follows from Lemma 2.4.1 and 2.4.3, with probability greater than 1 — S, we have the 
stated result. □ 



2.5. Model selection and active learning 



Given a model and n observations {xi, yi}f=i we know how to estimate the best sampling scheme 
p and to obtain the estimator Xm.p- The problem is that the model m might not be a good one. 
Instead of just looking at fixed m we would like to consider simultaneous model selection as in 
[10]. For this we shall pursue a more global approach based on loss functions. 

We start by introducing some notation. Set l{u,v) = (u — w)^ the squared loss and let 
Ln{x,y,p) = ■^'^l''^iqi'^l{x{ti),yi) be the empirical loss function for the quadratic difference 
with the given sampling distribution. Set L{x) := E {Ln{x, y,p)) with the expectation taken over 
all the random variables involved. Let Ln{x,p) :~ E^ (L„(y, where E^ () stands for the 

conditional expectation given the sample w, that is the expectation with respect to the random 
noise. It is not hard to see that 

1 " 

L{x) = -Y^q,E {l{x{t,),y,)) , 
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and 



1 W ' 

i{x,p) = - y^Qi—K {l{x{U),yi)) . 
n ^ Pi 



Recall that Xm.p = Rm.pD is the minimizer of Ln{x,y,p) over each Sm for given p and that 
Xm = RmXo is the minimizer of L{x) over Sm- Our problem is then to find the best approximation 
of the target xq over the function space S'o := Ume/ ^rn- In the notation of section 2.2 we assume 
for each m that Sm is a bounded subset of the linearly spanned space of the collection {4'j}jeim 
with |/„,| = dm- 

Unlike the fixed m setting, model selection requires controlling not only the variance term 



Xm. X 



but also the unobservable bias term ||a;o — Xm\\n n for each possible model S*,, 



m,p|jn,q ulil (xj.o'j lih^ uiiwuoc:i vcxuic uicxo ||^U '^ra\\n,q 

If all samples were available this would be readily available just by looking at Ln{x, y,p) for all 
Sm and p, but in the active learning setting labels are expensive. 

Set e„i := ||a:o ^ 2;„i||oo- In what follows we will assume that there exists a positive constant C 
such that sup„j Cm ^ C. Remark this implies sup,^ ||xo — a;m||n,g < QC, with Q defined in [AQ]. 

Recalling Pk £ V stands for the set of candidate sampling probabilities, set Pfc,min = niini(Pfc_i). 

Define 

Pfeanin V 2n 6 

pen,{m, Pk,6) ^ QC/3^,fc(l + fS'Jlf, (10) 

with 



and finally 



Cm{VT7+l) / dmQ L. ,3*27/4d2^(d,„ + l)fc(fc + l) 

Pm,fc = 7; \ V ^iOg( X 

2 V npk,min V 5 



pen,{m, Pk,S) = Qa' { r{l + Lm.k)^^ + (11) 

n an ) 



where Lm,k > is a sequence such that J2m k '^'^^'^•''^'^"^^^^ < 1 holds. We remark that 
the change from 5 to S/{dmidm + 1)) in peno and peni is required in order to account for the 
supremum over the collection of possible model spaces Sm- 

Also, we remark that introducing simultaneous model and sample selection results in the 
inclusion of term peng ^ /pk,min * \/1/?t. which includes an Loo type bound instead of an L2 
type norm which may yield non optimal bounds. Dealing more efficiently with this term would 
require knowing the (unobservable) bias term \\xo — Xm\\n,q- A reasonable strategy is selecting 
Pk,min = Pk,min{'m) > ||a;o " a^mUn.ij whenever this information is available. In practice, Pk,min 
can be estimated for each model m using a previously estimated empirical error over a subsample 
if this is possible. However this yields a conservative choice of the bound. One way to avoid this 
inconvenience is to consider iterative procedures, which update on the unobservable bias term. 
This course of action shall be pursued in section 3. 

With these definitions, for a given < 7 < 1 set 

pen{m,p, S, 7, n) 

= [^Po{m,p,5) + {— h -)peni{m,p,5) 

Pmin 7 

+ (^(- + 1) + -)pen2(m,p,5) + 2((c+ 1)^^^^;1^)2]. 



Pmin 7 7 P' 



min 
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and define 

LnAx,y,p) = Ln{x,y,p) + pen{m,p, 6,'-f,Ti). 

The appropriate choice of an optimal sampling scheme simultaneously with that of model 
selection is a difficult problem. We would like to choose simultaneously m and p, based on the 
data in such a way that optimal rates are maintained. We propose for this a penalized version 
of Xm^p, defined as follows. 

We start by choosing, for each to, the best sampling scheme 

p(m) = argminpe7i(m,p, (5, 7, n), (12) 
p 

computable before observing the output values ^-i^d then calculate the estimator Xm.p{m) = 

Rm.p(m)y which was defined in (2). 
Finally, choose the best model as 

TO argminL„,i(y, x„ p(,„),p(to)). (13) 

m 

The penalized estimator is then im :— Xm,p{m)- It is important to remark that for each model 
TO, p{m) is independent of y and hence of the random observation error structure. The following 
result assures the consistency of the proposed estimation procedure, although the obtained rates 
are not optimal as observed at the beginning of this section. 

Theorem 2.5.1. With probability greater than 1 — 5, we have 

1 — 47 

+ min{2po{m,Pk,S) H —peni{m,Pk,S) 

ra,k Pmin 

(l + 2/7)pen2(TO,Pfc,5))] 



p2 . 

^min 



1+7 

< — m.m[L{xm) + minpen(TO, Pfc, 5, 7, n)] 

1 — 47 rn. k 

Proof. The proof follows from Lemma 2.5.2 below □ 
In order to state Lemma 2.5.2 we introduce for any given p and x G Sm, x' G S'„i',the quantities 
l\x{x,x! ,p) := [Ln{x,y,p) - Ln{x,p)] - [Ln{x',y,p) - Ln{x,p)] 

and 

A2{x,x',p) := [Ln{x,p) - L{x)] - [Ln{x',p) - L{x')]. 

We then have. 

Lemma 2.5.2. Lete be a vector ofi.i.d. random vai^iables satisfying the moment condition [MC]. 
Assume that the conditions [AB], [AS] and [AQ] are satisfied. Assume that pk{ti) > Pk.min for 
alii = 1, . . . ,n and Lm,k ^ 0, such that the following Kraft inequality j, g-V'''' -'^ '"+1) < 1 
holds. Assume peuQ, peni and pen2 to be selected according to (9), (10) and (11) respectively. 
Let X G Sm and x' G Sm' ■ Then 

P{ sup {Ai{x,x',Pk) 

m.rn' ,k 

- [ 2 b^"2 {m,Pk,5)+ penl (m', Pk , 6)) 

TPmin 

+l{\\xo-x\\l^g + \\xo-x'\\l^)])>0) 
< S/3 
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and 

P{ sup {A2(i,„,p,x',Pfc) 

m.ni' ,k 

-[2po{m,Pk,S) + {— 1- -)peni{m,Pk,S) 

Pmin 7 

+2pa{m' ,Pk,S) H peni{m\Pk,S) 

Pmin 

+3711x0 - x,n\\l + (^— (- + 1) + -)pen2{m',Pk,S) 

Pmin 7 7 

+2((c+i) "'''^°^^^ n}>o) 

< 25/3 

Proof. In order to simplify notation throughout the proof of the lemma we use p instead of Pk ■ 
The first part of Lemma 2.5.2 is rather standard, the only care being taking into account the 
random norm. For any x e Sm and x' £ Sm' we have 

\Aiix,x',p)\ 



2 " w- 

- qi—ei{x - x'){U)\ 



2 " w- 

- \-y^^1i—^iixQ ~ X){ti)\ 



2 " w 
+ |- V'gi— ei(a;o - 
i=i ^' 

1 " w 

< 2||(a;o - a;)||„,,^,/j, sup -^qi—eiV^ 

n 

11 , /Ml t \ ^ 

+ 2||(xo - X )||„,gj„/p sup -/^li ^i'"i 

\\vU.qui/p=i-ves^, 

< 7(l!a;o-a::|i^,q + ||a;o-a:^'lll5) 

"'"TZl (ll^™:P^II?i,gtu/p + ll-^m',P^llri,g-u;/p)i 
'Pmin 

where we have used \\Rm,p£\\n,qw/p = s^P|lt)|l„,,„/p=i „ I]"=i 9iff ^i^* ^'iid inequality 2ab < 
70^ + 1/7^^ to obtain the stated result. 
Hence, 

P„,( sup [Al 

m.rn' .p 

2 (pen2(m,p, S) +pen2(m',p, 5)) 

-7(11X0 - X\\l g + \\X0 - x'Wl g)] > 0) 

< 2j2Pwi\\Rm,pe\\l^g^fp > pen2{m,p,S)) 

< S/3 (14) 
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by Lemma 2.4.4 and the choice of the penahzation pen2 in (11), recalhng Rm.p is a projection 
matrix. 

The term A2 requires a little more work. To begin with, for any x S 5^, write Ln{x,p) := 



a;-a;o|| 



,qw/p 



and L{x) := \\x — xoW^ q. Recall that, for a given m, Xm,p — xq = Rm,p{xo ■ 



{xm — xo) + Rm,p£- To deal with this term, we must consider all the terms in the square of this 
expression. Thus, 

~ ~ 1 " m- 



n ^-^ 

1 " 

n 

i—i 
1 " 



2^ . 



l)Rm,pixo - Xm)iU){Xm - Xo){U) 

l){Xm - Xo)^{U) 

^)[Rm,,p£]j 

i)[Rm.,p£]i[Rm.,pi^O - Xm)]{ti) 
i)[Rm,p£]iixo - Xm){ti) 



Start with la- 
Write 



Ia+Ib+Ic+Id + Ie+ If 



||i?m,p(xo - X„,){ti)\\l ,j^/p - ||i?™,p(xo - X,n)iti)\\l^q 
1 " 



Note that 



||[i?^.,p - E (i?,„,p)](xo - a;„0(t,)||^_^„/p - || [i?,„,p - E (i?™,p)](xo - 
1 



< 



R. 



Pn 



'771, p 



(■Rrn,p)]|lp (|1(2;0 - X,n){ti)\ 



^ ) 

n,qj I 



Whence from the choice oi peni, using Lemma 2.4.1 and summing over m we obtain 



P(supSUp[||[i?„,p - E {R,„,p)]{xo ~ 2:m)(ii) II n,g«,/p 

771 p 

-|| [^'m.p - E (i?,„,p)](xo - a;m)(ii)lllg - perii (m, Pfc , 5)] ) 
<(5/6. 



(15) 



We then use Lemma 2.4.5 to bound \\{^ {Rm,p) ~ Rm)[xQ ~ Xm]\\n.q = (c+l)n °'em/Pmin and 
achieve the bound of the term la- 
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For the term we start by remarking that 

n 

q{ti)—Rra,p{xo - Xm){U){Xra - Xo){U) = 

by orthogonahty. The term 

n 

'^q{U)Rm,pi^Q - Xr,i)iti){ m 

1=1 

is then bounded 

" 1 

[^g(li)-Rm,p(a;o - X,n){U){x„i ~ Xo){ti)f < ^\\xq - Xm\\i,q + -\\Rrn.p[^Q " 
i=l ^ 

and the proof follows as for la ■ 

For /c, the proof follows from Lemma 2.5.3 below, 

P(supsup{- - 1)0^;™ - xof{U)-po{m,p,5)} > 0) < (5/6. (16) 

m p IT' . Pi 

t 

For Id, Lemma 2.4.4 implies that 

P(supsup{i^g(t,)(^^ - l)[^™,pe]- - -^P2im,p,S)} > 0) < S/6. (17) 

m p n _ Pi Pmin 

The term /g follows exactly as for Ai. Finally, as for lb, by orthogonality we only have to bound 
the term 

n 

['^q{ti)[R,n,p£]iix„i - xa){ti)]'^ 
1=1 

whose proof then follows exactly as for Id- 

The proof then follows by gathering the bounds in (14), (15), (16) and (17). □ 

Lemma 2.5.3. Assume that there exists a positive constant C such that sup^ e„i < C with 
e-m = ||2;o — a^mlloo- Assume that the condition [AQ] is satisfied and that pk{ti) > Pk,min for all 
i = 1, . . . ,n. Assume peno to be selected according to (9). Then, 

P(supsup{||xo - x,n\W,q.^/p - \\xo - x,n\\n.q - pena{m , p , 5)} > 0) < (5/6 

m p 

Proof. Note that 

1 2 11^ ^, l|2 



Xo XmWn^q^ /p ll-^0 ^JmH^g 
1 " W 

-y^q{t^){— - l){xo - Xm?{t^). 



j=l ^ 



Let p* attain the supremum of this expression, so that 

1 1 2^0 ^ XmWn^qwfp* ^ 11^0 ^ ^mWn.q 

= sup { llx-O - X^\\l ,j.^^p -\\X0- XrnWl.q] 
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Since x — xq is not random and is uniformly bounded 



\v p, J 

= - E 9(^0(^0 " ^n.)\m f - 1)) = 0, 



Whence from the choice of peno in (9), using the bounded differences inequahty ([4]). Thus, we 
have 

P(supsup{||a;o - x^\\l.q.u,/p - lla^o - x,n\\l,g ~ peno{m,p,S)} > 0) < V — — , 

m p ^ DTTlytTl + i j 

which yields the desired result. □ 
2.6. Error bounds for the general bounded case 

The above procedure can be extended to other frameworks, defined by minimization over 5*0 = 
y Sm of a given loss function 



I " w- 
Ln{x,y,p) = -y2qt—l{x{ti),yi) 

II ^ — ^ r): 



n '■ — ' •p i 



with expectation L(x) = E {Ln{x,y,p)) = ^J27=i1i'^ {l{x{ti),y{ti))). Set, as above, L„(x,p) = 
E,{L„{x,y,p)) = ^Etl1^'^^Wx{U),yCt^)))■ We will denote l^x) = E{l{x,y)). In order to 
repeat the proof of section 2.5 it is necessary to control both the fluctuations of Ln{x,y,p) — 
Ln(x' , y , p) — [Ln{x , p) — Ln{x' , p)] and L„ {x , p) — Ln{x' , p) — [L{x) — L{x')]. The first term typically 
requires setting bounds for 

1 " w- 

A{m,p) := sup -yq, — [l{x{t,),y^)-l{x{U))]. 
xes^ Pt 

Assuming l{x, y) is uniformly bounded by a constant (which is not the case for the example 
presented in section 2.5) standard arguments (see for example [4] for a very thorough discussion), 
combining the bounded differences inequality and bounds for Radamacher sums lead to 

P(A(m, Pk) ~ + E (2i?„(m, k))] > 0) < 



k{k + l)m(m + 1) 
with 

/ 4^4 iog(2m(m + l)fc(fc + 1)/^) 

and 

/ n 

/ I V — ^ in: 

il{x{ti),yi 



(1 " w- 
sup - "S^ qi—Ui 



where ai is a sequence of independent (Radamacher) random variables, P{(t = —1) = P{(t = 
1) = 1/2 and independent of y,;. The above discussion yields, 

P(sup A(m, Pk) - [t™,fe + 2E (i?„(m, k))] > 0) < S (18) 

7n,k 
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For the second term it is then necessary to bound 

1 " w 

A'(m,fc):= sup -S^ q^{^ - l)l{x{U)). 

If l{x) is bounded, again combining the bounded differences inequahty and bounds for Radamacher 
sums lead to 

P(A>, P.) - [t,„,. + 2E (Kim, k))] > 0) < 

with 



and 



This yields, 



/4P4 log(2m(m + l)k{k + 1)/S) 



i?'j(m, fc) = E(j I sup — "S^ qi—(Til{x{ti)) 



P(sup A'(m, Pk) - [t„,fe + 2E (P;(m, fc))] > 0) < (5 (19) 

Actually, the bounds in section 2.5 follow from bounding the Radamacher sums in that case by 
i?„(m, fc) < CTr{Rn^p^), for a certain constant C, using Lemma 2.4.4 (which follows from a 
functional exponential inequality proved in [5]) and choosing adequate terms Lk^m in order to 
assure converge of the sum. Hence, it would seem that equation (19) does not add any interesting 
information to what has already been discussed. However, the more general setting is important 
because (in the bounded case) allows us to pass from Radamacher sums to V-C dimensions (see 
again [4] for a general discussion) which allows us in turn to consider more general solution 
spaces (than a numerable union of target model spaces Sm and a numerable collection of target 
probabilities). Bounds in this case would be 



P sup P„ a;,y,p - L x >2L/ |^^+2W ^ <^ 20 

where V is the V-C dimension of the class of functions Sq. 

In practice a reasonable alternative is estimating the overall error by cross-validation or leave 
one out techniques and then choose m minimizing the error for successive essays of probability 
p. Recall, in the procedure of section 2.5, labels are not required to obtain p. Of course this 
requires a stock of "extra" labels, which might not be affordable in the active learning setting. 
However, many applications suggest that p (or a threshold version of p which eliminates points 
with sampling probability Pi < rj a certain small constant) helps finding "good" or informative 
subsets, over which model selection may be performed. Other intermediate versions of error 
bounding include using empirical versions of the V-C dimension. Empirical error minimization 
is specially useful for applications where what is required is a subset of very informative sample 
points, as for example when deciding what points get extra labels (new laboratory runs, for 
example) given a first set of complete labels is available. 
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3. Iterative procedure: updating the sampling probabilities 

A major drawback of the batch procedure is the appearance of Pmin in the denominator of error 
bounds, since typicahy Pmin must be smaU in order for the estimation procedure to be effective. 
Indeed, since the expected number of effective samples is given by rij := E {J2iPi)i small values 
of Pi are required in order to gain in sample efficiency. 

A closer look at the proofs shows it is necessary to improve on the bounds of expressions such 

as 

1 " w- 

- qi—ei{x - x'){U) 

or 

Thus, it seems like a reasonable alternative to consider iterative procedures for which at time 
j Pj{i) ~ mayix^x'es \x(ti) — x'(ti)\ with Sj the current hypothesis space. In what follows we 
develop this strategy, adapting the results of [2] from the classification to the regression problem. 
Although we continue to work in the setting of model selection over bounded subsets of linearly 
spanned spaces, results can be readily extended to other frameworks such as additive models 
or kernel models. Once again, wc will require certain additional restrictions associated to the 
uniform approximation of xo over the target model space. 

More precisely. We start with an initial model set S{= Smo) ^nd set a;* to be the overall 
minimizer of the loss function L{x) over 5*. Assume additionally 

AU sup-^^gma.xt(z{ti....,t„} \xo{t) - x{t)\ < B 

Let Ln(x) = Ln{x,y,p) and L(x) be as in section 2.5. For the iterative procedure introduce 
the notation 

" e=i Pj 

In the setting of Section 2 for each < j < ?i, Sj will be the linear space spanned by the collection 
{(j)t}i(ii. with \Lj \ = dj. 

In order to bound the fluctuations of the initial step in the iterative procedure we consider 
the quantities defined in equations (5) and (7) for r = 7 = 2. That is, 

Ao ^ 2.^Qy2(do + l)^V(2/^)l 



no no 



with 



,(VT7 + 1) / doQ 



2 y noPn 

As discussed in section 2.4, Ao requires some initial guess of ||.to — a^moll^i,^- Since this is not 
available, we consider the upper bound B^. Of course this will possibly slow down the initial 
convergence as Ao might be too big, but will not affect the overall algorithm. Also remark we 
do not consider the weighting sequence Lk of equation (7) because the sampling probability is 
assumed fixed. 

Next set Bj — sup^. max^gj^^ ^t^j \x{t) — x' (t)\ and define 
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2(rf,+l) ^ log\4ino+j){no+j + l)/S) 
no +j no+ j 



/log(4(no+j)(no+j + l)/(5)- 



16B2(2Bj A 1)2Q2 




no+j V "0 + j 

The iterative procedure is stated as follows: 

1. For j = 0: 

• Choose (randomly) an initial sample of size no, Mq = {ti-^ , ■ • • , ti^^ }. 

• Let xq be the chosen solution by minimization of Lo{x) (or possibly a weighted version 
of this loss function). 

• Set So C {x G S : Lq{x) < Lo{xo) + Ao} 

2. At step j: 

• Select (randomly) a sample candidate point ti^, ti. ^ Mj^i. Set Mj = U {ti-} 

• Set pj = (max:c^2;/g5^_j \xtj — Xf^ \ A 1) and generate Wj ~ Ber{pj). If = 0, set 
J = J ' + 1 and go to (2) to choose a new sample candidate. If Wj = 1 sample yt- and 
continue. 

• Let Xj = argminj;g5^._i Lj(a:) + Aj_i(a:;) 

• Set Sj C {x e Sj-i : Lj{x) < Lj{xj) + Aj} 

• Set j = j + 1 and go to (2) to choose a new sample candidate. 

Remark, that such as it is stated, the procedure can continue only up until time n (when there 
are no more points to sample). If the process is stopped at time T < n the term \og{n(n + 1)) 
can be replaced by log(T(r + 1)). 

Also, instead of the term 4a/4 ^'^^"''"'"V°^" in the second line of the definition of A,-, we could 
have used 4i?j, the associated Radamacher sum. Recall the Radamacher sum over the class 5'j_i 
is given by 



Rj = Eo- sup 

As discussed in section 2.6, Rj < Rq < \J ^^°sM^s ^ t^i^q^-q jg fj^Q Y-C dimension of the class of 

functions ^ = {./ = (.t — .tq)^, a; G 5}. The quantity 2{dj + 1) in the definition of A^- is obtained 
by using well known properties of the V-C dimension, using x'^ is a convex function and dj + 1 
is the V-C dimension of the linear space Sj = Sj — Xo, to obtain the stated weight. 
We have the following result, in the spirit of Theorem 2 in [2]. 

Theorem 3.0.1. Let x* = arg minxe s L{x). Set 6 > 0. Then, with probability at least 1 — S for 
any j < n 

• \L{x) - L{x')\ < 2Aj_i, for all x, x' G Sj 

• L{xj) < [L{x*) + 2Aj_i] 

An important issue is related to the initial choice of toq and tiq- As the overall precision of the 
algorithm is determined by L{x*), it is important to select a sufficiently complex initial model 
collection. However, if » uq then Aq can be big and pj ^ 1 for the first samples, which 
leads to a more inefficient sampling scheme. 

Proof of Theorem 3.0.1: the proof is based on the following preliminary Lemma. 
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Lemma 3.0.2. For any 6 > 0, with probability at least I — S, for all j < n and all x, x' G S^-i 

\L,{x)~L,{x')-[L{x)-L{x')]\<l^,. 

Set (5 > so from lemma 3.0.2 

\L,{x)-L,{x')^[L(:x)-L{x')]\<^, 

holds for all < j < n, x, x' S 5'(j_i)vo with probability at least \ — 5. Hence, for any x, x' G Sj-i, 
since Sj^i C 5j_2, from the definition of Sj-2 

L{x) - L{x') < Lj^iix) - Lj^iix') + A,_i 

On the other hand, over the chosen event with probability greater than 1 — (5, by the choice 
of Aq and the results in section 2.4, x* £ Sq from the definition of 5*0. We shall now prove by 
induction that over the stated event x* € Sj for 1 < j < n. Assume x* G Sj-2- By lemma 3.0.2, 

Lj_i{x*) - Lj_i(£j„i) < L{x*) - + Aj_i < A^-^i, 

so that X* G 5^-1, which ends the proof by induction. Whence, for all 1 < j < n, L{xj) < 
L{x*) + 2Aj_i, which ends the proof of the Theorem. 

Proof of lemma 3.0.2: 

For fixed j and any x, x' G S'j-i we have 

L,(x)-L,(x')-\L{x)~L{x')\ 
1 w 

= — - - x){ii/)[x + x' - 2xo)(i^J 

J + '^o ^ Vi 

2 \ - w, 



^I,+II,. 



Eqi—£i{x{ti^) - x'{tij) 
ri: 



Set Zi = qiWiix — x'){ti^){x + x' — 2xo){ti^)., so that H^^iljoo < 2QBj{2Bj A 1) and Ij satisfies the 
bounded difference inequality with = AQ^B^{2Bj A 1)^. By equation (19) in section 2.6 we 
have 



P{I, > Wlog(4K + j)(no + .7 + 1)/S) + ^U ^' ' ) 



< P{I, > Wlog(4(no + 3){no + J + l)/5) ' ' + 4E {R,)) 

V "0 + J 

<V2(K+.7)K+.j + l)). 
Next we deal with II Set u{t) = so that <liu'^itu) < 1- Then, 



< sup —y^QiUiSt 

= VQW^Sj-i^Wno+j.q, 
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where Us stands for the projection over Sj. Whence by Lemma 2.4.4, 



' ' y ^^^^ ^^^j 

< 5/2{{no+j){no+j + l)). 
Summing over j ends the proof. 



3.1. Effective sample size 

For any samphng scheme the expected number of effective samples is, as already mentioned, 
E (X^iPi)- Whenever the sampling policy is fixed, this sum is not random and effective reduction 
of the sample size will depend on how small sampling probabilities are. However, this will increase 
the error bounds as a consequence of the factor l/pmin- The iterative procedure allows a closer 
control of both aspects and under suitable conditions will be of order y/{L{x*) + Aj), as 
we will prove below establishing appropriate bounds over the random sequence pj . Recall from 
the definition of the iterative procedure we have Pj{i) ~ maxx^x'eSj \x(ti) — x'{ti)\, whence the 
expected number of effective samples is of the order of laaxx^x'eSj \x{ti) — x'{ti)\. It is then 
necessary to control sup^ x'eSj-i \ ~ in terms of the (quadratic) empirical loss function 

Lj. For this we must introduce some notation and results relating the supremum and L2 norms 
([3]). 

Let 5* C L2nLoo be a linear subspace of dimension d, with basis $ :— {4>j,j G tis}, \ms\ — d. 

Set r]{S) := sup^^^ ,j,_^o ^TpT' ^* 7n ^^Pp.i^r^o ^ ^'^^^ '^here A stands 

for any orthonormal basis of S. We require the following result in [3]: 

Lemma 3.1.1. (Lemma 1 [3]) Let S be a d dimensional linear subspace of L2 H Loo, with basis 
andsetT]{S) := ^ sup^gg ||t||oo/||i||2- Then 

1- ^(^) = II '^111^/2 

2. r]{S) <r< r]{S)Vd 

Example 3.1.2. Some examples of r for typical linear settings include ([3], pp 337-338): 

1. Trigonometric expansions: r < V2d. 

2. Polynomials: r < d. 

3. Localized basis: 

• {0j = Vdl[(j-i)/dj7d]}i<i<d- r < 1 

• Piecewise polynomials on [0, 1] of degree m: r < 2m + 1 

• Orthonormal wavelet systems: r < C , for a certain constant C depending on the form 
of the basis. 

We have the following result 

Lemma 3.1.3. Letxj be the sequence of iterative approximations to x* andpt{j) be the sampling 
probabilities in each step of the iteration, j — 1,...,T. Then, the effective number of samples, 

that is, the expectation of the required samples Ug —K bounded by 



T T 
2V2r{^/L{^^^ + J2\/d^)- 
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Proof. Wc have 



< sup ||a;-.T'||^ < 4 sup 

x.x'^Sj x^Sj 

< ^r^dj sup L[x — X*) 



< Ar'^dj sup [L{x) + L{x*)] 

x£Sj 

< ir^dj{2L{x*) + 2Aj). 

The third mequahty follows from Lemma 3.1.1 and the fifth from the bound sup^^g^ L{x) < 
L{x*) + 2Aj as follows from Theorem 3.0.1. The proof is achieved by calculating the square root 
of each side of the last series of inequalities and finally using that VcT+T < ^/a + \/b. 

□ 



4. Appendix 

Proof. (Lemma 2.4.2) 

The proof follows closely the ideas of the proof of Theorem 7.3 p. 62 in [8]. Let A e R"^™ be a 
matrix, with rows a{l) € M™, 1=1,. . .n, satisfying 

\\a{l)\\2<KV^ (21) 

for some constant K > 1. Recall A'^A = X^zLi a(Ofl(0*- 

For the first part of the lemma we must bound E,, = E (^Wj^^A^ A - E {A*A))\\'pj , where 
:= (^"^^) lip- Using the symetrization Lemma (see [8]), we have for all 2 < r < oo, 

Er<iJ^yE{\\eMl)ailf\\;). 

where e = (ei, . . . , e„) is a Rademacher sequence independent of a(l), . . . , a{n). 
Thus, the following inequality holds 

Er<(^X2'/''mr^/'e-^/'-E f P||: inax ||a(/)||! 

\nAA J \ ' l=l,...,n 



2 > 



where we have used Rauhut's Lemma 6.18, p. 46 [8] (which is a version of Rudelson's Lemma 
[9]) and the Cauchy-Schwarz inequality to obtain the stated result. 

Furthermore, using the bound (21) and applying the triangle inequality yields 

Efpii; max ||a(0||^ 

\ / — l,...,n 

< 



^E(prA|l;;)E(^^maxJ|a(0|| 
W/2(nAA)'"/'((E {\\-^{A^A-E{A'A)w\ f/'^ + lY'^, (22) 



< K 



where we have used ||E [l/nA^A) / A^jlp = 1. Now, recall 



V \/nAA V " 
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Then, using the inequahty (22) 

Whence, squaring the last inequahty and completing squares yields 




In the following we assume that Sr.m.n < 1/2. Thus, 

where t = (\/T7 + l)/4. 

For the second part of the lemma we want to bound in probability || — E (A*A))||p. 

The proof then follows directly from the first part of the lemma using the Markov inequality. 

□ 
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